11 research outputs found
LightGlue: Local Feature Matching at Light Speed
We introduce LightGlue, a deep neural network that learns to match local
features across images. We revisit multiple design decisions of SuperGlue, the
state of the art in sparse matching, and derive simple but effective
improvements. Cumulatively, they make LightGlue more efficient - in terms of
both memory and computation, more accurate, and much easier to train. One key
property is that LightGlue is adaptive to the difficulty of the problem: the
inference is much faster on image pairs that are intuitively easy to match, for
example because of a larger visual overlap or limited appearance change. This
opens up exciting prospects for deploying deep matchers in latency-sensitive
applications like 3D reconstruction. The code and trained models are publicly
available at https://github.com/cvg/LightGlue
Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization
Many robotics applications require precise pose estimates despite operating
in large and changing environments. This can be addressed by visual
localization, using a pre-computed 3D model of the surroundings. The pose
estimation then amounts to finding correspondences between 2D keypoints in a
query image and 3D points in the model using local descriptors. However,
computational power is often limited on robotic platforms, making this task
challenging in large-scale environments. Binary feature descriptors
significantly speed up this 2D-3D matching, and have become popular in the
robotics community, but also strongly impair the robustness to perceptual
aliasing and changes in viewpoint, illumination and scene structure. In this
work, we propose to leverage recent advances in deep learning to perform an
efficient hierarchical localization. We first localize at the map level using
learned image-wide global descriptors, and subsequently estimate a precise pose
from 2D-3D matches computed in the candidate places only. This restricts the
local search and thus allows to efficiently exploit powerful non-binary
descriptors usually dismissed on resource-constrained devices. Our approach
results in state-of-the-art localization performance while running in real-time
on a popular mobile platform, enabling new prospects for robotics research.Comment: CoRL 2018 Camera-ready (fix typos and update citations
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
Robust and accurate visual localization is a fundamental capability for
numerous applications, such as autonomous driving, mobile robotics, or
augmented reality. It remains, however, a challenging task, particularly for
large-scale environments and in presence of significant appearance changes.
State-of-the-art methods not only struggle with such scenarios, but are often
too resource intensive for certain real-time applications. In this paper we
propose HF-Net, a hierarchical localization approach based on a monolithic CNN
that simultaneously predicts local features and global descriptors for accurate
6-DoF localization. We exploit the coarse-to-fine localization paradigm: we
first perform a global retrieval to obtain location hypotheses and only later
match local features within those candidate places. This hierarchical approach
incurs significant runtime savings and makes our system suitable for real-time
operation. By leveraging learned descriptors, our method achieves remarkable
localization robustness across large variations of appearance and sets a new
state-of-the-art on two challenging benchmarks for large-scale localization.Comment: Camera-ready for CVPR 201
AffineGlue: Joint Matching and Robust Estimation
We propose AffineGlue, a method for joint two-view feature matching and
robust estimation that reduces the combinatorial complexity of the problem by
employing single-point minimal solvers. AffineGlue selects potential matches
from one-to-many correspondences to estimate minimal models. Guided matching is
then used to find matches consistent with the model, suffering less from the
ambiguities of one-to-one matches. Moreover, we derive a new minimal solver for
homography estimation, requiring only a single affine correspondence (AC) and a
gravity prior. Furthermore, we train a neural network to reject ACs that are
unlikely to lead to a good model. AffineGlue is superior to the SOTA on
real-world datasets, even when assuming that the gravity direction points
downwards. On PhotoTourism, the AUC@10{\deg} score is improved by 6.6 points
compared to the SOTA. On ScanNet, AffineGlue makes SuperPoint and SuperGlue
achieve similar accuracy as the detector-free LoFTR
The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation
Deep learning has enabled impressive progress in the accuracy of semantic segmentation. Yet, the ability to estimate uncertainty and detect failure is key for safety-critical applications like autonomous driving. Existing uncertainty estimates have mostly been evaluated on simple tasks, and it is unclear whether these methods generalize to more complex scenarios. We present Fishyscapes, the first public benchmark for anomaly detection in a real-world task of semantic segmentation for urban driving. It evaluates pixel-wise uncertainty estimates towards the detection of anomalous objects. We adapt state-of-the-art methods to recent semantic segmentation models and compare uncertainty estimation approaches based on softmax confidence, Bayesian learning, density estimation, image resynthesis, as well as supervised anomaly detection methods. Our results show that anomaly detection is far from solved even for ordinary situations, while our benchmark allows measuring advancements beyond the state-of-the-art. Results, data and submission information can be found at https://fishyscapes.com/.ISSN:0920-5691ISSN:1573-140
Back to the Feature: Learning Robust Camera Localization from Pixels to Pose
International audienceCamera pose estimation in known scenes is a 3D geometry task recently tackled by multiple learning algorithms. Many regress precise geometric quantities, like poses or 3D points, from an input image. This either fails to generalize to new viewpoints or ties the model parameters to a specific scene. In this paper, we go Back to the Feature: we argue that deep networks should focus on learning robust and invariant visual features, while the geometric estimation should be left to principled algorithms. We introduce PixLoc, a sceneagnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model. Our approach is based on the direct alignment of multiscale deep features, casting camera localization as metric learning. PixLoc learns strong data priors by end-to-end training from pixels to pose and exhibits exceptional generalization to new scenes by separating model parameters and scene geometry. The system can localize in large environments given coarse pose priors but also improve the accuracy of sparse feature matching by jointly refining keypoints and poses with little overhead. The code will be publicly available at github.com/cvg/pixloc